[SPARK-30761]SQL] Nested column pruning should not prune on required child outputs in Generate#27503
[SPARK-30761]SQL] Nested column pruning should not prune on required child outputs in Generate#27503viirya wants to merge 1 commit intoapache:masterfrom
Conversation
| val requiredAttrs = AttributeSet(g.requiredChildOutput) | ||
| NestedColumnAliasing.getAliasSubMap(g.generator.children, requiredAttrs).map { |
There was a problem hiding this comment.
This case normally should be treated by above case pattern (Project + Generate). But if all nested fields are selected at top Project, the above case won't prune. Then when Optimizer transforms down to the underlying Generate, only the referred nested column are kept and others are pruned from the child. It causes the accessors at top Project unresolved.
| } | ||
|
|
||
| testSchemaPruning("select explode of nested field of array of struct and " + | ||
| "all remaining nested fields") { |
There was a problem hiding this comment.
Instead of fixing case by case, can we try to find all the possible cases and ensure we can cover all the possible query plans? Includes negative and positive cases.
Also, we need to have the unit test cases for these optimizer rules.
Do we traverse all the ancestor nodes? |
|
@gatorsmile I'm ok to revert it. |
|
Test build #118085 has finished for PR 27503 at commit
|
What changes were proposed in this pull request?
We prune nested fields from Generate. If a child output is required in a top operator of Generate, we should not prune nested fields on it. Otherwise, the accessors on top operator could be unresolved.
Why are the changes needed?
A required child output means it is referred as a whole or by its nested fields on top of operator of Generate. If the rule prunes other nested fields from it, the accessors on top operator will be unresolved.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit tests.